This reading list will help you to prepare for each session of the module. In addition to outlining the required, highly recommended, and recommended reading for each topic, there is also a more general breakdown of suggested resources, particularly those related to learning how to do quantitative social science using R.
Where possible, I have added stars to indicate how accessible I believe a given textbook is. One star (*) means the text should be accessible for most people without any prior reading; two stars (**) means that the text should be mostly accessible, but may require some intensive study or prior reading; and three stars (***) means that the text is more complex, technical, or specialist than most, and might require working up to, cross-referencing terms with simpler texts, or going over multiple times.
This module has three texts that are highly recommended and are used for much of the preparatory reading. However, for each week I have also supplied additional or alternative reading, much of which is from textbooks that are available for free online. You should not feel constrained by these textbooks. You should also not feel like you need to complete all of the additional reading for each topic; my personal recommendation is to read one text for each week and then read further if you are still struggling with the topic or find it particularly interesting, or if the chosen reading did not work well for you. Preparatory reading is a guide for your independent learning; not a rule.
With that said, I have tried to base most preparatory and additional/alternative reading on the following three texts:
This is a classic text that covers statistical theory without a large amount of focus on the underlying mathematics. It is a very short book and does not complicate the subject matter too heavily by introducing other things like exercises in a programming language or piece of software at the same time. Available on bookshop.org (£9.29) or older editions available for free on archive.org.
This is an applied quantitative analysis textbook with real-world examples in R. The textbook covers a lot more of the basics and fundamentals than some other textbooks on the subject and largely follows the structure of this course and the Rowntree book. Reading both Rowntree and Fogarty side-by-side will help you cover much of the module content in depth, including the skills needed for assessment one (multiple linear regression). Can be ordered from SAGE with 30% off using SAGE30 promotional code here (£34.99 or £24.49 with discount code).
This is a more intermediate text which goes into additional depth on many topics, particularly on probability, causality, and inference. It also covers additional important analytical methods, including spatial analysis, social network analysis, and textual analysis. This textbook will help you deepen what you learn from the two more accessible texts above and will provide some of the skills you may need for assessment two (cluster analysis, spatial analysis). If you already have some knowledge of quantitative research, you might want to use equivalent chapters in this textbook rather than Fogarty (2019) to develop your learning further. Online copies are freely accessible through the library’s StarPlus service or available from publisher here (£40, or £28.00 with ABS21 or P239 discount code).
This section outlines the required preparatory reading for each week and its purpose. You will also find additional reading that you might find interesting or useful.
Sometimes you will come across some topics or concepts out of order to how we cover them in the class. Don’t worry about this too much! There are a great many pedagogical differences that mean that some people choose to introduce topics in different orders. If something you read about hasn’t been explained in class, try not to panic, it will either be covered later or is not essential.
You should not feel restricted to only the texts in this list and I would encourage you to use your own initiative to find ways to work in R that work for you. There are large numbers of online tutorials, bootcamps, MOOCs, documentation, blogs, and other forms of content that can help you achieve the same goals in R using different tools. Don’t be afraid to google a topic!
The purpose of this week is to get you thinking about the nature of quantitative social science (what it posits about the way the social/human world works, its epistemological and ontological underpinning), what kinds of questions it might be useful for answering, and how it can be used. I would encourage you to regularly visit literature like this which will assist you in thinking critically about research methods and methodology.
Walter, M. (2006). The nature of social science research. Social research methods: An Australian perspective, 1-28.*
Powell, T. C. (2020). Can quantitative research solve social problems? Pragmatism and the ethics of social research. Journal of Business Ethics, 167(1), 41-48. https://doi.org/10.1007/s10551-019-04196-7 **
On the philosophy of science and quantitative social research:
Â
Â
Additional reading for getting started with R (setting up and basic interaction with the console and scripts):
R. ‘Quantitative Social Science: An introduction’. Princeton University Press.**or
R and RStudio. Quantitative Social Science Data with R. Sage.*or
or
R and RStudio. Data Science in Education Using R. Routledge. (Free: https://datascienceineducation.com) *Before we can use any kind of quantitative methods of analysis, we need to learn how we can actually quantify things. While this might sound extremely obvious, it is incredibly important to know the different types of quantification that exist as these dictate what kinds of data visualisation, summary statistics, and statistical tests and models we should use. They also allow us to check for patterns in our data that might violate some of the assumptions these tests, summary statistics, and models use.
By the end of this week, you should know the difference between continuous, ordinal, and categorical variables as well as appropriate summary statistics and visualisations for each and how to produce them using R.
Additional reading that applies these concepts in R:
R. Sage.*R. Sage.*R. Sage.*Now that we’ve learned how to quantify single variables and inspect their variation in R depending on their type, we can explore how to inspect relationships between two variables. By the end of this week, you should be able to demonstrate you know which kinds of visualisations and descriptive statistics to use to explore the relationships between different types of variable in your data.
This can include scatterplots and correlation statistics for two continuous/ordinal variables; contingency tables and heatmaps for two categorical variables; and mean-differences and boxplots/ridge plots for ordinal/categorical and ordinal/continuous variables.
R. Sage.*R. Sage.*Additional examples in R for data visualisation in ggplot
More in depth treatment of the statistical theory behind some forms of comparison:
Often, an important goal for quantitative social scientists is being able to make generalisable claims about the patterns that exist in their data: that they apply to the entire population of interest and not just their specific sample. The most common way we achieve this is through inferential statistics and statistical hypothesis testing.
By the end of this week, you should be able to explain the logic behind statistical testing, how tests relate to specific hypotheses, and how to interpret a p-value. You should also have a working knowledge of what kinds of samples and survey methodologies can lead to population inference. Between this week and the following week, you will have practiced running and interpreting the results of some bivariate statistical tests in R.
Alternative treatment of statistical theory:
Inferential statistics in R
R. Sage.*R. Sage.*or
or
or
We should now know when it is appropriate to generalise our findings to a wider population depending on how our data has been collected, but another major claim we might often want to make about quantitative research is whether we can argue that a relationship is causal or not.
As with inferential statistics, there are a number of conditions that need to be satisfied for different strengths of causal evidence. By the end of this week, you should be able to rate the degree to which causality can be inferred based on study design.
and
Week 6 is a reading week — my recommendation is for you to use this week to revisit the preparatory reading, or engage with the additional reading, of a topic from the previous weeks that you still feel somewhat challenged by. My suggestion below, for example, is to revisit some of the statistical theory literature on the topic of inference which many students tend to struggle with.
If you feel like you have a very good understanding of everything covered so far, I would recommend you either (a) engage with a practical learning resource to develop and reinforce your R skills, such as the #TidyTuesday project or (b) read ahead on the topic of regression.
Now that we have covered the bulk of the pre-requisite statistical theory and familiarity with working in R we can move onto the workhorse of contemporary quantitative social science: regression. Regression may seem daunting at first, but once you become familiar with its core concepts you will be able to easily run and interpret regression models and see how they share features with other types of statistical analysis.
By the end of this week you should be able to explain how linear regression works and interpret the output of a bivariate linear regression of a normally distributed dependent variable on a continuous independent variable. Further, you will be able to run a model like this in R.
Any of the preparatory or additional readings from Week 8: Multiple linear regression.
In week 8, you will take what you learned about linear regression and extend our regression models to include multiple predictors of an outcome. This will illustrate how powerful regression models can be for social science, especially where there may be confounding variables to control for post hoc.
By the end of this week, you will have be able to include multiple independent variables into regression models in R, including categorical variables, and interpret the output.
R. Sage.*R. Sage.*or
Imai, K. (2017) Chapter 4.2: Linear regression ‘Quantitative Social Science: An introduction’. Princeton University Press.**
Imai, K. (2017) Chapter 4.3.2: Regression with multiple predictors ‘Quantitative Social Science: An introduction’. Princeton University Press.**
or
As we will have seen by week 9, multiple linear regression can be incredibly flexible to answer a number of research questions by including multiple predictors — but what if we need more flexibility around what kind of outcome variable we are interested in? In week 9 we will look at logistic regression, a type of Generalised Linear Model (GLMs) for predicting a binary categorical outcome.
By the end of this week you will be able to run and interpret the output of a logistic regression model in R.
Lottes, I. L., DeMaris, A., & Adler, M. A. (1996). Using and interpreting logistic regression: A guide for teachers and students. Teaching Sociology, 284-298. https://doi.org/10.2307/1318743 **
Dalpiaz, D. (2021). Chapter 17: Logistic regression. Applied Statistics with R (Free online: http://daviddalpiaz.github.io/appliedstats/) ***
The second to last topic we will touch on is cluster analysis. There are many statistical and data learning methods that sit outside of the regression framework, especially those associated with machine learning, that are increasingly being used in the social sciences. Cluster analysis sits in one such area within a wider collection of methods under the umbrella of ‘unsupervised machine learning’. It can be useful for identifying underlying ‘groups’ of observations based on their characteristics.
By the end of this week, you should be able to run two relatively simple algorithms for clustering observations by their features: k-means and agglomerative hierarchical clustering.
Note: Cluster Analysis should not be confused with Factor Analysis. Cluster analysis attempts to find clusters of observations while factor analysis tries to reduce the number of dimensions in data by identifying a smaller number of latent factors associated with large groups of variables.
or
R Programming Guide, ‘k-Means Cluster Analysis’ (https://uc-r.github.io/kmeans_clusterin) *or
and
R Programming Guide, ‘Hierarchical Cluster Analysis’ (https://uc-r.github.io/hc_clustering) *or
R. Routledge.R. Second Edition. Springer. (Available for free here: https://www.statlearning.com) **When we do quantitative social research it’s not uncommon for us to unthinkingly abstract our data from its spatial context, despite the impact that space and place has on lives. Spatial data can lead to critical new insights about segregation or integration, or can create powerful policy messages or strategies.
By the end of this final week, you will be able to plot spatial data in the form of choropleth maps and with data points to identify patterns using R. You will also learn a basic measure of spatial autocorrelation, Moran’s I.
sf: Simple Features for R. https://r-spatial.github.io/sf/index.html (See articles pages for walkthroughs on using sf)*One of the great things about learning R rather than a commercial statistical package is that it’s free and has a large community of users who provide free resources for learners. The downside of this is that the amount of resources out there and the vast differences in how people code and teach can be quite overwhelming!
Below are a few resources that I think are helpful in addition to the books and articles used throughout the course. In particular, I would encourage you to use resources like the #TidyTuesday datasets and data from the UK Data Service to download and practice analysing data that you haven’t seen before.
The easiest way to learn applied social science statistics is to embark on a project where you have to use them. I guarantee that if you were to spend a few hours a week applying the skills you’ve learned in class to a fresh dataset in an independent project, you will retain that knowledge much better (even if it is with a silly example of data!). What’s more, because we are using R scripts, you will have a record of what you did to go back to when you need to do it again in future.
TidyTuesday is a weekly recurring community data tidying, analysis, and visualisation project where R learners and users are encouraged to download, explore, visualise, and model open data and share their results (or seek help!) on social media (though you do not need to do this). There are now nearly four years worth of weekly-released datasets that can be revisited!
While this is not a social science-specific resource, many of the datasets submitted are related to social science. Some examples include:
These data are often small in scope (number of variables) and consistent in file type, so can be easier to jump into working with than some of the much larger datasets like those from the UK Data Service.
Tidyverse Style GuideThis is not so much a resource as a guide for some general principles to follow when writing code in R. The tidyverse style guide emphasizes readability of code and following it can make it much easier to troubleshoot any problems. It is also my preferred style which will help me when it comes to answering your questions in the class workshops or drop-in sessions!
Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.
The UK Data Service is a national data service that provides access to social and economic data from censuses and surveys in the UK. It hosts many of the major surveys used for social science research in the country and internationally, including the Labour Force Survey, the UK Household Longitudinal Survey, the British Social Attitudes Survey, and the Crime Survey for England and Wales.
These are often quite large collections of data and you need to register with the UK Data Service to access and download them. This is an easy process and you can put your reason for downloading any data as learning purposes. You can also access Teaching Datasets, which can be helpful for learning as they tidy the data first (though in a quantitative research career you will have to learn how to do this yourself!)
Remember that you will have to make use of meta documents to know what each variable is. You might also need to use a package like haven to read in data that is only available in SPSS, Stata, or SAS files (I would recommend using Stata files if csv files are not available).
Some links to teaching datasets include:
The two textbooks for this course also contain a wealth of small data examples that you can use alongside their content or independently.
You can find the data for chapters of Brian Fogarty’s Quantitative Social Science Data with R: An introduction on the student companion website here: https://study.sagepub.com/fogarty
You can find the data for chapters of Kosuke Imai’s Quantitative Social Science: An Introduction here: http://qss.princeton.press/student-resources-for-quantitative-social-science/
LEMMA is an online course developed by the University of Bristol. Despite its purpose being to train people to use Multilevel Modelling, it also has some excellent coverage and tutorials for introductory social statistics. It previously only had tutorials using MLWiN, but these have now grown to include tutorials using R and Stata.
To see an overview of the LEMMA course content, click here.
R for Data ScienceR for Data Science has grown to be a go-to text for people learning R. Written by Hadley Wickham and Garrett Grolemund, it covers data manipulation and visualisation in detail. In addition, it also provides excellent practical advice on workflow and managing data science-related projects in R. Most importantly, this book follows Wickham and colleagues’ ‘tidy’ approach to data science in R, which includes both the tidying of data into standard, rectangular formats, and the use of tidyverse tools, which emphasize the human readability of code.
The book is available for free online.
RLastly, if you want some additional materials for a data scientist’s approach to modelling in R, you might like Julia Silge’s online course, Supervised Machine Learning Case Studies in R, and her associated screencasts. This course and screencasts provide an interesting view on statistical modelling and prediction from the perspective of a data scientist, and you will see some similarities and differences. In addition, these materials will also teach you how to use the packages in the tidymodels collection of packages.